{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "```\n",
        "# Copyright 2023 Google Inc.\n",
        "\n",
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "\n",
        "#     http://www.apache.org/licenses/LICENSE-2.0\n",
        "\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License.\n",
        "```"
      ],
      "metadata": {
        "id": "Nc5z20tucbYc"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "This colab supports UniProt 2023_01, where Google predicted protein names for 10s of millions of proteins previously named \"Uncharacterized protein\".\n",
        "\n",
        "You can run this file to check whether any prediction (especially for previously \"Uncharacterized\" proteins) produced by Google's systems is supported by other sources.\n",
        "\n",
        "\n",
        "**Paste in the UniProt accession of the protein below!**"
      ],
      "metadata": {
        "id": "NrIKMuHM_pS2"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Imports and dependencies"
      ],
      "metadata": {
        "id": "yvGAanHAsf04"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!apt-get install pigz"
      ],
      "metadata": {
        "id": "o43XlscFsCIZ",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "f43c67b4-f917-4e87-9552-214e8abb43fd"
      },
      "execution_count": 1,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Reading package lists... Done\n",
            "Building dependency tree       \n",
            "Reading state information... Done\n",
            "The following NEW packages will be installed:\n",
            "  pigz\n",
            "0 upgraded, 1 newly installed, 0 to remove and 22 not upgraded.\n",
            "Need to get 57.4 kB of archives.\n",
            "After this operation, 259 kB of additional disk space will be used.\n",
            "Get:1 http://archive.ubuntu.com/ubuntu focal/universe amd64 pigz amd64 2.4-1 [57.4 kB]\n",
            "Fetched 57.4 kB in 0s (167 kB/s)\n",
            "Selecting previously unselected package pigz.\n",
            "(Reading database ... 128275 files and directories currently installed.)\n",
            "Preparing to unpack .../archives/pigz_2.4-1_amd64.deb ...\n",
            "Unpacking pigz (2.4-1) ...\n",
            "Setting up pigz (2.4-1) ...\n",
            "Processing triggers for man-db (2.9.1-1) ...\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install binary_file_search"
      ],
      "metadata": {
        "id": "YhaVGrgU98Qe",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "aafc2014-f447-48cc-b4e7-eebb10793f4f"
      },
      "execution_count": 2,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
            "Collecting binary_file_search\n",
            "  Downloading binary_file_search-0.7.tar.gz (4.0 kB)\n",
            "  Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
            "Building wheels for collected packages: binary_file_search\n",
            "  Building wheel for binary_file_search (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
            "  Created wheel for binary_file_search: filename=binary_file_search-0.7-py3-none-any.whl size=3968 sha256=08db9047bad570a67c78614c5cfd8b1417b8ffc8e7bf4d24222a6608f59ee3bf\n",
            "  Stored in directory: /root/.cache/pip/wheels/4e/ad/a2/3a9a72f26e1b3dc30147de9f09a853f4e2c73c2683a71bba2d\n",
            "Successfully built binary_file_search\n",
            "Installing collected packages: binary_file_search\n",
            "Successfully installed binary_file_search-0.7\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "id": "ymlaxuFFWFlE"
      },
      "outputs": [],
      "source": [
        "from binary_file_search.BinaryFileSearch import BinaryFileSearch\n",
        "import IPython.display\n",
        "def print_markdown(string):\n",
        "    IPython.display.display(IPython.display.Markdown(string))"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Download evidence file and unzip\n",
        "(takes a few minutes)"
      ],
      "metadata": {
        "id": "2NbvPJy2sjFf"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!wget -c -O sorted_evidencer.tsv.gz https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2023_01_evidencer_sorted.tsv.gz"
      ],
      "metadata": {
        "id": "DC0CFu7m9tDY",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "21d6db46-af8c-4ba7-f552-9944883da620"
      },
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "--2023-03-13 18:15:10--  https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2023_01_evidencer_sorted.tsv.gz\n",
            "Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.31.128, 172.253.62.128, 142.251.163.128, ...\n",
            "Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.31.128|:443... connected.\n",
            "HTTP request sent, awaiting response... 200 OK\n",
            "Length: 317478150 (303M) [application/octet-stream]\n",
            "Saving to: ‘sorted_evidencer.tsv.gz’\n",
            "\n",
            "sorted_evidencer.ts 100%[===================>] 302.77M   166MB/s    in 1.8s    \n",
            "\n",
            "2023-03-13 18:15:12 (166 MB/s) - ‘sorted_evidencer.tsv.gz’ saved [317478150/317478150]\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "!unpigz -fk sorted_evidencer.tsv.gz"
      ],
      "metadata": {
        "id": "g7crO7tz919X"
      },
      "execution_count": 5,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Search for your evidence\n",
        "\n",
        "We search for evidence for a prediction via two alignment methods: sequence alignment via [**phmmer**](http://hmmer.org/) and structure alignment via [**tmalign**](https://zhanggroup.org/TM-score/).\n",
        "\n",
        "In particular, for a given prediction, we define as \"evidence\" a protein in UniProt that has a protein name matching our prediction and which has a similar amino acid sequence, as given by a phmmer score [above 25](https://hmmer-web-docs.readthedocs.io/en/latest/searches.html#significance-bit-scores) or a tmalign score [above 0.5](https://zhanggroup.org/TM-score/) (with a confident AlphaFold structure). We ignore proteins named by ProtNLM when searching for evidence in UniProt.\n",
        "\n",
        "When found, we provide one example evidence for each alignment method."
      ],
      "metadata": {
        "id": "cU4GrPdE_SfH"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "accession = \"A0A3M3H8U9\" #@param {type:\"string\"}\n",
        "with BinaryFileSearch('sorted_evidencer.tsv', sep=\"\\t\", string_mode=True) as bfs:\n",
        "  try:\n",
        "    lines = bfs.search(accession)\n",
        "  except KeyError:\n",
        "    raise ValueError('Sorry, this protein\\'s accession wasn\\'t found in our database. Maybe check your spelling, or maybe this prediction wasn\\'t provided by Google?')\n",
        "\n",
        "if len(lines) > 1:\n",
        "  for l in lines:\n",
        "    print(l)\n",
        "  raise ValueError('There was some sort of error - we found multiple predictions for this protein!', lines)\n",
        "\n",
        "\n",
        "if len(lines[0]) == 2:\n",
        "  accession, prediction = tuple(lines[0])\n",
        "  to_print = (f'The prediction **{prediction}** for **{accession}**: \\n'\n",
        "              f\"* **no support** found with these automated methods. Perhaps the protein is very new or the prediction is wrong?\")\n",
        "else:\n",
        "  accession, prediction, _, alignment_support, structure_support = tuple(lines[0])\n",
        "  to_print_prefix = f'The prediction **{prediction}** for **{accession}**: \\n'\n",
        "  to_print = to_print_prefix\n",
        "\n",
        "  if alignment_support:\n",
        "    to_add = f\"* has a strong phmmer alignment to **{alignment_support}** (bit score > 25).\\n\"\n",
        "    to_print += to_add\n",
        "  if structure_support:\n",
        "    to_add = f\"* has a structural alignment to the high-confidence AlphaFold structure for **{alignment_support}** (tmalign score > .5).\\n\"\n",
        "    to_print += to_add\n",
        "\n",
        "print_markdown(to_print)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 87
        },
        "cellView": "form",
        "id": "N17c3oC39_0e",
        "outputId": "7df1cbf5-8b50-42f2-b53c-ccb914d81ded"
      },
      "execution_count": 18,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<IPython.core.display.Markdown object>"
            ],
            "text/markdown": "The prediction **ATPase** for **A0A3M3H8U9**: \n* has a strong phmmer alignment to **A0A345UY82** (bit score > 25).\n* has a structural alignment to the high-confidence AlphaFold structure for **A0A345UY82** (tmalign score > .5).\n"
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# License of downloaded data\n",
        "\n",
        "The evidencer file is licensed CC-BY 4.0 and is built based on UniProt.\n",
        "\n",
        "\"UniProt: the universal protein knowledgebase in 2021.\" Nucleic acids research 49, no. D1 (2021): D480-D489."
      ],
      "metadata": {
        "id": "jnCOhzbTJ1RM"
      }
    }
  ]
}